-
Notifications
You must be signed in to change notification settings - Fork 90
Port SitemapExtractor from CC-MRJob to CC-PySpark
#54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. I'll continue with testing. But several points need a discussion.
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
Signed-off-by: Damian Stewart <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Some minor things remain to do.
Tested on my local machine:
- successfully run the unit tests using both the pyspark module or the installed Spark. For the latter, it's required to set
PYTHONPATH=$PWD/test:$(ls $SPARK_HOME/python/lib/py4j-*-src.zip):$SPARK_HOME/python:$PYTHONPATH - successfully tested
sitemaps_from_robotstxt.py. Output looks good. On a small test sample, there are no differences in the number of extracted sitemap URLs with the cc-mrjob implementation. - failed to run
sitemaps_from_robotstxt_fastwarc.py
requirements.txt
Outdated
| orjson | ||
| warcio | ||
|
|
||
| # for validating URLs in robots.txt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
sitemaps_from_robotstxt.py
Outdated
| from typing import Optional | ||
| from urllib.parse import urlparse, urljoin | ||
|
|
||
| import validators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| def test_host_accumulation_same_host(spark): | ||
| """ | ||
| Test accumulation of hosts when sitemap url host and robots.txt url host match | ||
| Requires test/ on PYTHONPATH so utils._process_jobs can be imported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 4-line function could be inlined to avoid the import error.
However, test/ is required on the path also to load test_sitemaps_from_robotstxt which is required for Spark serialization. So, you need to run it like
PYTHONPATH=$PYTHONPATH:./test python -m pytest test -v
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README has been updated with this information
| robots_txt_with_more_than_50_sitemaps = None | ||
|
|
||
|
|
||
| def init_accumulators(self, session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also log_accumulators needs to be overridden, otherwise the class-specific accumulators are never shown resp. not preserved once the job has finished. See cc_index_word_count.py or wat_extract_links.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| # process only WARC response and metadata (including WAT) records | ||
| fastwarc_record_filter = WarcRecordType.response | ||
|
|
||
| # process_record is implemented by SitemapExtractorJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "main" block is required in order to run the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, also run the job to verify that there are no errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
sitemaps_from_robotstxt.py
Outdated
|
|
||
| if robots_txt_url is None: | ||
| # first sitemap found: set base URL and get host from URL | ||
| robots_txt_url = record.rec_headers['WARC-Target-URI'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not compatible with FastWARC, should be self.get_warc_header(record, 'WARC-Target-URI')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: Damian Stewart <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, @damian0815!
Unit tests pass, successfully run both versions (warcio and fastwarc) locally.
I'll also run the job on a real cluster later today and merge the PR if this test passes as well.
Would you mind to squash the commits to a small and meaningful number? But I can do it when merging as well. Thanks!
|
I think it's easier/cleaner if you squash when merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've successfully run sitemaps_from_robotstxt_fastwarc.py on 5% of the robots.txt captures of CC-MAIN-2025-43 using a single-node Hadoop cluster (Spark on YARN):
13:57:14.391 [Thread-5] INFO SitemapExtractorFastWarc - WARC/WAT/WET input files processed = 5000
13:57:14.394 [Thread-5] INFO SitemapExtractorFastWarc - WARC/WAT/WET input files failed = 0
13:57:14.396 [Thread-5] INFO SitemapExtractorFastWarc - WARC/WAT/WET records processed = 4444146
13:57:14.398 [Thread-5] INFO SitemapExtractorFastWarc - robots.txt successfully parsed = 4444146
13:57:14.401 [Thread-5] INFO SitemapExtractorFastWarc - sitemap urls found = 3949561
13:57:14.403 [Thread-5] INFO SitemapExtractorFastWarc - sitemap urls with invalid utf-8 encoding = 76
13:57:14.405 [Thread-5] INFO SitemapExtractorFastWarc - robots.txt announcing at least 1 sitemap = 1885795
13:57:14.408 [Thread-5] INFO SitemapExtractorFastWarc - robots.txt with more than 50 sitemaps = 3547
I'll merge the code. Thanks, @damian0815!
Port cc-mrjob/sitemaps_from_robotstxt.py.
crawl-tools/server/seed/sitemaps/sitemaps_robotstxt.py